Skip to content

Add DataDome server-side validation and multi-backend routing#470

Open
jevansnyc wants to merge 1 commit intomainfrom
arena-vcl-replacement
Open

Add DataDome server-side validation and multi-backend routing#470
jevansnyc wants to merge 1 commit intomainfrom
arena-vcl-replacement

Conversation

@jevansnyc
Copy link
Collaborator

@jevansnyc jevansnyc commented Mar 11, 2026

Adds two complementary features for edge-level request control:

DataDome server-side validation:

  • New server_side_enabled config flag on DataDome integration
  • validate_request_server_side() calls the DataDome HTTP API at the edge before proxying to origin, blocking detected bots with 403
  • Configurable: API key, validation endpoint, timeout, fail-open behavior, and sample rate for gradual rollout
  • Integration cached in OnceLock -- built once, reused across requests

Multi-backend routing:

  • New BackendRouter in backend_router.rs selects origin URL based on request host and path (domain index + path prefix/regex patterns)
  • Arena Group backend config lives in crates/fastly/backends.toml and is merged into the embedded binary config at build time by build.rs
  • trusted-server.toml contains only generic schema documentation
  • Synthetic cookie is only set on responses when not already present

Summary

Changes

File Change

Closes

Closes #317

Test plan

  • cargo test --workspace
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo fmt --all -- --check
  • JS tests: cd crates/js/lib && npx vitest run
  • JS format: cd crates/js/lib && npm run format
  • Docs format: cd docs && npm run format
  • WASM build: cargo build --bin trusted-server-fastly --release --target wasm32-wasip1
  • Manual testing via fastly compute serve
  • Other:

Checklist

  • Changes follow CLAUDE.md conventions
  • No unwrap() in production code — use expect("should ...")
  • Uses tracing macros (not println!)
  • New code has tests
  • No secrets or credentials committed

Copy link
Collaborator

@aram356 aram356 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Adds DataDome server-side bot validation and multi-backend routing. The backend router design is solid with good test coverage. However, the build.rs changes break environment variable overrides at build time, and several runtime issues need attention.

Findings

🔧 Must fix

  • build.rs uses from_toml instead of from_toml_and_env: env var overrides silently ignored at build time (build.rs:54)
  • rerun_if_changed misses integration env vars: Settings::default() has empty IntegrationSettings HashMap (build.rs:32)
  • Sampling hash has poor distribution: byte-sum hash produces identical values for IPs with same bytes in different order (datadome.rs:506)
  • validation_timeout_ms configured but never applied: misleading config field (datadome.rs:148)

❓ Questions

  • DataDome validation on every request: static assets (CSS/JS/images) all trigger a 200ms HTTP call — intentional? (publisher.rs:276)
  • backends.toml in open-source repo: ~175 production domains routing to a specific origin — should this be private?

🤔 Worth addressing

  • BackendRouter::new errors silently swallowed via .ok() (publisher.rs:237)
  • Regex patterns panic at request time, not startup (backend_router.rs:66)
  • Domain index silently overwrites on collision (backend_router.rs:120)
  • No tests for validate_request logic (~80 lines, zero coverage)

🌱 Future considerations

  • static OnceLock patterns prevent test isolation (publisher.rs:227, datadome.rs:751)
  • extract_host (datadome.rs:301) uses fragile trim_start_matches parsing when url crate is already a dependency. Also, split('/').next() on a non-empty string never returns None, so unwrap_or("api-js.datadome.co") is unreachable with a misleading hardcoded domain.

CI Status

  • Analyze (JS/TS): PASS
  • Rust build/test/clippy/fmt: Not run

// Merge base TOML with environment variable overrides and write output
let settings = settings::Settings::from_toml_and_env(&toml_content)
// For build time: use from_toml to parse with environment variables
let mut settings = settings::Settings::from_toml(&toml_content)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 from_toml() does not read environment variables — there's even a test proving this (test_from_toml_ignores_env_vars). The original build.rs used from_toml_and_env() to merge TRUSTED_SERVER__* env vars into the binary config. This silently breaks deployments that set secrets (API keys, origin URLs) via env vars at build time.

let mut settings = settings::Settings::from_toml_and_env(&toml_content)
    .expect("Failed to parse settings at build time");

println!("cargo:rerun-if-changed={}", BACKENDS_CONFIG_PATH);

// Create a default Settings instance and convert to JSON to discover all fields
let default_settings = settings::Settings::default();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 Settings::default() returns an empty IntegrationSettings HashMap, so collect_env_vars will never discover any TRUSTED_SERVER__INTEGRATIONS__* env vars. Changing e.g. TRUSTED_SERVER__INTEGRATIONS__DATADOME__SERVER_SIDE_KEY won't trigger a rebuild.

The old build.rs used an always-rebuild sentinel (_always_rebuild_sentinel_) specifically because integration keys can't be enumerated. Consider restoring that approach.

// Sample rate check: only validate sample_rate% of requests
if self.config.sample_rate < 100 {
// Use client IP hash for deterministic sampling
let hash = client_ip.bytes().fold(0u32, |acc, b| acc.wrapping_add(u32::from(b)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 This sums ASCII byte values — IPs like "1.2.3.4" and "4.3.2.1" produce identical hashes. IPs in the same /24 subnet cluster together since only the last octet differs. Sampling will be very non-uniform.

let hash = client_ip.bytes().fold(0u32, |acc, b| {
    acc.wrapping_mul(31).wrapping_add(u32::from(b))
});

/// Lower timeout = faster fail-open on `DataDome` API issues
#[serde(default = "default_validation_timeout_ms")]
#[validate(range(min = 50, max = 1000))]
pub validation_timeout_ms: u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 This field is validated (50-1000ms range) but never applied — the comment at line 564 says Fastly doesn't support per-request timeouts. Users setting validation_timeout_ms = 100 would expect it to take effect. Either wire it into BackendConfig timeout support or remove the field to avoid misleading configuration.


// DataDome server-side validation (if enabled)
// This validates requests at the edge before they reach the origin
match crate::integrations::datadome::validate_request_server_side(settings, &req) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ This runs on every proxied request — CSS, JS, images, fonts all trigger a DataDome HTTP call (200ms timeout). Most bot protection validates only document/navigation requests. Is this intentional? If so, worth a comment explaining why.

fn compiled_regex(&self) -> Option<&Regex> {
if let (Some(regex_cell), Some(pattern)) = (&self.path_regex, &self.path_regex_pattern) {
Some(regex_cell.get_or_init(|| {
Regex::new(pattern).expect("path_regex pattern should be valid")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Invalid regex in config will expect() panic on the first matching request, not at startup. BackendRouter::new() doc says it returns errors for invalid patterns but doesn't actually validate them. Consider compiling the regex eagerly in the constructor and returning the error there.

for (idx, backend) in backends.iter().enumerate() {
for domain in &backend.domains {
let normalized = normalize_domain(domain);
domain_index.insert(normalized, idx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 If two backends claim the same domain (or domains that normalize to the same value), the second silently overwrites the first. Could cause hard-to-debug routing. Consider detecting duplicates and logging a warning.

/// Unique identifier for logging/debugging (optional).
#[serde(default)]
pub id: Option<String>,
/// Backend-specific timeouts (unused - for future use).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ ~30 lines of code for "unused - for future use". Consider removing until actually wired into BackendConfig.

/// assert_eq!(normalize_domain("sub.example.com"), "sub.example.com");
/// ```
#[must_use]
pub fn normalize_domain(domain: &str) -> String {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛏ Allocates twice on every request (to_lowercase() then to_string() after trimming). Domains stored in the index are already normalized at construction time, but the incoming host is re-normalized on every call. Could use Cow<str> to avoid allocation when no transformation is needed.

pub struct BackendRoute {
pub origin_url: String,
pub certificate_check: bool,
#[allow(dead_code)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛏ If the field is unused, consider removing it rather than suppressing the warning.

Copy link
Collaborator

@aram356 aram356 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Adds server-side DataDome bot validation at the edge and a multi-backend routing system that selects origins by domain/path. Both features default to disabled with safe fail-open behavior.

Findings

🔧 Blockers (P0)

  • build.rs no longer applies env var overrides: from_toml_and_env was changed to from_toml, so env vars like TRUSTED_SERVER__INTEGRATIONS__DATADOME__SERVER_SIDE_KEY are never merged. The rerun_if_changed env var enumeration is now dead code. (crates/common/build.rs:54)

🔧 High (P1)

  • Sampling hash has poor distribution: Byte-sum hash for IPv4 sampling produces collisions for IPs with same octets in different order. Use DefaultHasher or similar. (crates/common/src/integrations/datadome.rs:506)
  • Validation runs on every request: Static assets (images, CSS, JS, fonts) all trigger a blocking HTTP call to DataDome. Should filter to document/navigational requests only. (crates/common/src/publisher.rs:276)
  • Duplicate domain silently overwrites: If two backends declare the same domain, HashMap::insert keeps the last one with no warning. (crates/common/src/backend_router.rs:120)

🤔 Medium (P2)

  • validation_timeout_ms has no effect: Fastly Compute doesn't support per-request timeouts, but the field is validated and documented as if it works. Misleading for operators. (crates/common/src/integrations/datadome.rs:148)
  • Hardcoded fallback in extract_host (crates/common/src/integrations/datadome.rs:306): .unwrap_or("api-js.datadome.co") — if URL parsing fails, this silently uses a hardcoded domain instead of the configured api_origin. Should propagate the error or derive the fallback from config.
  • BackendTimeouts unused: Declared "for future use" — adds config surface with no current benefit. (crates/common/src/settings.rs:364)
  • Inconsistent cookie trimming: trim_start() in publisher.rs vs trim() in datadome.rs for the same pattern. (crates/common/src/publisher.rs:216)
  • No test coverage for validate_request: Sampling, cookie extraction, request building, response handling, and fail-open/closed logic are all untested. (crates/common/src/integrations/datadome.rs:751)

⛏ Low (P3)

  • 175 hardcoded domains in backends.toml: Customer-specific data baked into the binary. Consider whether this should come from a config store. (crates/fastly/backends.toml)

CI Status

  • Analyze (JS/TS): PASS
  • fmt: Not run
  • clippy: Not run
  • rust tests: Not run
  • js tests: Not run

// Merge base TOML with environment variable overrides and write output
let settings = settings::Settings::from_toml_and_env(&toml_content)
// For build time: use from_toml to parse with environment variables
let mut settings = settings::Settings::from_toml(&toml_content)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 P0 — Env var overrides no longer applied at build time

This changed from from_toml_and_envfrom_toml, which means environment variables like TRUSTED_SERVER__INTEGRATIONS__DATADOME__SERVER_SIDE_KEY are no longer merged at build time. The rerun_if_changed function below still enumerates env vars, but they're never applied — making it dead code.

Fix: Restore from_toml_and_env, or if this is intentional, remove the rerun_if_changed env var enumeration and document why env vars are no longer baked in.

// Sample rate check: only validate sample_rate% of requests
if self.config.sample_rate < 100 {
// Use client IP hash for deterministic sampling
let hash = client_ip.bytes().fold(0u32, |acc, b| acc.wrapping_add(u32::from(b)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 P1 — Sampling hash has poor distribution for IPv4

Simple byte-sum produces identical hashes for IPs with the same octets in different order (1.2.3.4 == 4.3.2.1). With % 100, sampling will be far from uniform.

Fix:

use std::hash::{Hash, Hasher};
let mut hasher = std::collections::hash_map::DefaultHasher::new();
client_ip.hash(&mut hasher);
let sample = (hasher.finish() % 100) as u8;


// DataDome server-side validation (if enabled)
// This validates requests at the edge before they reach the origin
match crate::integrations::datadome::validate_request_server_side(settings, &req) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 P1 — Validation runs on every request including static assets

validate_request_server_side is called unconditionally — images, CSS, JS, fonts all trigger a blocking HTTP call to DataDome (200ms default timeout). This:

  • Adds latency to every static asset
  • Multiplies DataDome API usage/cost
  • Could hit DataDome rate limits

DataDome validation is typically needed only for document/HTML requests.

Fix: Filter by request path or accept header to skip known static assets, or at minimum only validate navigational requests.

for (idx, backend) in backends.iter().enumerate() {
for domain in &backend.domains {
let normalized = normalize_domain(domain);
domain_index.insert(normalized, idx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 P1 — Duplicate domain silently overwrites

If two backends declare the same domain, HashMap::insert silently keeps the last one. This can cause hard-to-debug routing issues in production.

Fix: Log a warning or return an error:

if let Some(existing) = domain_index.insert(normalized.clone(), idx) {
    log::warn!("Domain '{}' appears in multiple backends (index {} and {})", normalized, existing, idx);
}

/// Lower timeout = faster fail-open on `DataDome` API issues
#[serde(default = "default_validation_timeout_ms")]
#[validate(range(min = 50, max = 1000))]
pub validation_timeout_ms: u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 P2 — validation_timeout_ms has no effect

This field is validated and documented, but the comment at line 564 says Fastly Compute doesn't support per-request timeouts. Operators will configure this expecting it to work.

Fix: Either remove the field, or log a warning at startup when it's set to a non-default value so operators aren't misled.

/// Backend-specific timeouts (unused - for future use).
#[serde(default)]
#[validate(nested)]
pub timeouts: BackendTimeouts,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 P2 — BackendTimeouts unused

The comment says "unused - for future use." This adds config surface area and serialization overhead with no current benefit. Per project conventions: don't design for hypothetical future requirements.

.map(|cookies| {
cookies
.split(';')
.any(|cookie| cookie.trim_start().starts_with(&synthetic_cookie_prefix))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 P2 — Inconsistent cookie trimming

trim_start() here vs trim() in datadome.rs:521 for the same cookie-parsing pattern. Standardize on trim() for robustness.

) -> Result<bool, Report<TrustedServerError>> {
// Cache the integration after the first call. Settings are fixed at startup,
// so first-caller-wins is correct — the same binary serves all requests.
static INTEGRATION: OnceLock<Option<Arc<DataDomeIntegration>>> = OnceLock::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 P2 — No test coverage for validate_request logic

The entire server-side validation flow — sampling, cookie extraction, request building, response status handling, fail-open/closed — has no unit tests. The #[cfg(test)] in get_client_ip shows testing was considered but the actual validation path is uncovered.

Adds two complementary features for edge-level request control:

DataDome server-side validation:
- New server_side_enabled config flag on DataDome integration
- validate_request_server_side() calls the DataDome HTTP API at the edge
  before proxying to origin, blocking detected bots with 403
- Configurable: API key, validation endpoint, timeout, fail-open behavior,
  and sample rate for gradual rollout
- Integration cached in OnceLock -- built once, reused across requests

Multi-backend routing:
- New BackendRouter in backend_router.rs selects origin URL based on
  request host and path (domain index + path prefix/regex patterns)
- Arena Group backend config lives in crates/fastly/backends.toml and is
  merged into the embedded binary config at build time by build.rs
- trusted-server.toml contains only generic schema documentation
- Synthetic cookie is only set on responses when not already present
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants